Wine Quality Reds Exploration by HanByul Yang

Summary of the data set

## [1] 1599   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
  1. Data set consists of 1599 red wine and have 11 input attributes and 1 output attributes.
  2. The quality is varies from 3 to 8 with median 6.
  3. The alcohol is varies from 8.4% to 14.9%.
  4. The median qulity of red wine is 6. median residual.sugar is 2.2 g / dm^3. median alcohol of red wine is 10.2 %.

Univariate Plots Section

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Most red wines have fixed acidity between 7.10 g/dm^3 and 9.20 g/dm^3.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Most red wines have volatile acidity between 039 g/dm^3 and 0.64 g/dm^3. There are some outliers above 1.5

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## 
## FALSE  TRUE 
##  1467   132
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
##    0 0.49 0.24 0.02 0.26  0.1 0.01 0.08 0.21 0.32 0.03 0.09  0.3 0.31 0.04 
##  132   68   51   50   38   35   33   33   33   32   30   30   30   30   29 
##  0.4 0.42 0.39 0.12 0.22 0.25  0.2 0.23 0.33 0.06 0.34 0.44 0.48 0.07 0.18 
##   29   29   28   27   27   27   25   25   25   24   24   23   23   22   22 
## 0.45 0.14 0.19 0.29 0.05 0.27 0.36  0.5 0.15 0.28 0.37 0.46 0.13 0.47 0.52 
##   22   21   21   21   20   20   20   20   19   19   19   19   18   18   17 
## 0.17 0.41 0.11 0.43 0.38 0.53 0.66 0.35 0.51 0.54 0.55 0.68 0.63 0.16 0.57 
##   16   16   15   15   14   14   14   13   13   13   12   11   10    9    9 
## 0.58  0.6 0.64 0.56 0.59 0.65 0.69 0.74 0.73 0.76 0.61 0.67  0.7 0.62 0.71 
##    9    9    9    8    8    7    4    4    3    3    2    2    2    1    1 
## 0.72 0.75 0.78 0.79    1 
##    1    1    1    1    1

The citric acid has three peaks around 0, 0.25 and 0.5g/dm^3. 138 red wines have 0 g/dm^3 citric acid. There is an outlier that has 1.0 g/dm^3.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The histotram of residual sugar has one peak and long-tailed. Most of red wines have residual sugar between 1.9 g/dm^3 to 2.6 g/dm^3: median 2.2g/dm^3 and mean 2.539 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Most of red wines have chlorides between 0.07 g/dm^3 to 0.09 g/dm^3: median 0.079 g/dm^3 and mean 0.08747 g/dm^3. Transform x axis with log10, histogram of chlorides seems to have a normal distribution.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Most free.sulfur.dioxide values are integers and most of them are between 7 mg/dm^3 and 21 mg/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

All of total.sulfur.dioxide values are integers. Most red wines have a total.sulfur.dioxide between 22 mg/dm^3 and 62 mg/dm^3. There are some outliers above 250 mg/dm^3.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

## Warning: position_stack requires constant width: output may be incorrect

The density value seems to display a normal distribution with major values between 0.995 and 1.0.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The pH also seems to have a normal distribution. Most of red wines have a pH between 3.21 and 3.4: median 3.31 and mean 3.311.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The sulphates has outliers above 1.5 g/dm^3 and has peak around 0.6.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The alcohol varies between 8 to 14 with major peaks around 10. Most of red wines have a alcohol between 9.5 and 11.1: median 10.2 and mean 10.42.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   5   6   7   4   8   3 
## 681 638 199  53  18  10

All of quality values are integers and between 3 and 8. Most of red wines have a quality between 5 and 6: median 6 and mean 5.636

## [1] "3" "4" "5" "6" "7" "8"
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Created “quality_class” factord variable for bi and multivariate analysis.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wines and have 11 input features and 1 output feature(quality).

What is/are the main feature(s) of interest in your dataset?

The main features in the data set is quality. I’d like to find which chemical properties influence the quality of red wine. I suspect alcohol and some other features can be used to build a pridictive model to quality of red wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Alcohol, density and pH seems to contribute the quality. I think alcohol is most significant feature because red wine is a kind of liquor.

Did you create any new variables from existing variables in the dataset?

Yes, I created “quality_class” variable for bivariate or multivariate analysis of “quality” feature.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The citric acid has three peaks around 0, 0.25 and 0.5g/dm^3. 138 red wines have 0 g/dm^3 citric acid which is the highest peak.

The chlorides is log-transformed. the transformed distribution shows normal distribution. Most of red wines have chlorides between 0.07 g/dm^3 to 0.09 g/dm^3: median 0.079 g/dm^3 and mean 0.08747 g/dm^3.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

The alcohol and sulphates are the most correlated features with quality. The volatile.acidity is negatively correlated with quality.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

As quality increases, median of alcohol tends to increse.

## 
## Call:
## lm(formula = quality ~ alcohol, data = wineSubset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8489 -0.4065 -0.1787  0.5176  2.5909 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.81782    0.17512   10.38   <2e-16 ***
## alcohol      0.36646    0.01672   21.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7083 on 1596 degrees of freedom
## Multiple R-squared:  0.2314, Adjusted R-squared:  0.2309 
## F-statistic: 480.4 on 1 and 1596 DF,  p-value: < 2.2e-16

The linear model of alcohol and quality has R^2 value 0.2314.

Alcohol has a negative correlation with density.

Also, alcohol has a negative correlation with total.sulfur.dioxide, free.sulfur.dioxide and chlorides.

As quality increases, median of density tends to decrease.

There seems to no strong relationship between quality and fixed.acidity.

As quality increases, median of volatile.acidity tends to decrease.

As quality increases, median of citric.acid tends to increase.

The citric acid has three peaks around 0, 0.25 and 0.5g/dm^3 at each quality.

There seems to no strong relationship between quality and residual.sugar

As quality increases, median of chlorides tends to decrease.

There seems to no relationship between quality and free.sulfur.dioxide.

There seems to no relationship between quality and total.sulfur.dioxide.

As quality increases, median of pH tends to decrease.

As quality increases, median of sulphates increase

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality correlates with alcohol and sulphates and negatively correlated with volatile.acidity.

Citric.acid distributed with three peaks at 0, 0.25 and 0.5 g/dm^3 at each quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Alcohol has a negative correlation with density. Alcohol also has a negative correlation with total.sulfur.dioxide, free.sulfur.dioxide and chlorides.

What was the strongest relationship you found?

The quality of red wine is positively correlated with alcohol and sulphates and negatively correlated with volatile.acidity.

Multivariate Plots Section

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

As quality increases, most of alcohol increse.

Alcohol has a negative correlation with density.

There seems no specific relation with alchol and pH.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

In every quality, citric.acid tends to have peaks at 0, 0.25, 0.5 g/dm^3.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wineSubset)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wineSubset)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = wineSubset)
## 
## ===============================================
##                      m1        m2        m3    
## -----------------------------------------------
## (Intercept)        1.818***  3.038***  2.547***
##                   (0.175)   (0.185)   (0.196)  
## alcohol            0.366***  0.319***  0.315***
##                   (0.017)   (0.016)   (0.016)  
## volatile.acidity            -1.384*** -1.221***
##                             (0.095)   (0.097)  
## sulphates                              0.685***
##                                       (0.100)  
## -----------------------------------------------
## R-squared             0.231     0.322     0.341
## adj. R-squared        0.231     0.321     0.340
## sigma                 0.708     0.666     0.656
## F                   480.388   378.330   274.938
## p                     0.000     0.000     0.000
## Log-likelihood    -1715.379 -1615.409 -1592.411
## Deviance            800.741   706.568   686.521
## AIC                3436.757  3238.818  3194.823
## BIC                3452.887  3260.324  3221.705
## N                  1598      1598      1598    
## ===============================================

This linear model has R^2 value 0.341. Used three highest absolute correlated variables with quality such as alcohol, volatile.acidity and sulphates.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There is a negative correlateion between alcohol and density feature. It can be seen in every quality in red wine data set.

Despite of low R-squared value of 0.341, I built a linear model that pridcit the quality of red wine.

Were there any interesting or surprising interactions between features?

From the above the scatter plot of alcohol and pH, there seems to be any specific relations between alcohol and pH.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model using quality and alcohol, volatile.acidity and sulphates. The variables are selected by absolute value of correlation of quality.

The variables in this linear model can account for 34.1% of the variance in the quality of red wines.


Final Plots and Summary

Plot One

Description One

The citric acid has three peaks around 0, 0.25 and 0.5 g/dm^3. 138 red wines have 0 g/dm^3 citric acid which is the highest peak.

Plot Two

Description Two

The quality of red wine is correlated with alcohol and sulphates and negatively correlated with volatile.sulphates.

Plot Three

Description Three

Alcohol has a negative correlation with density. Regardless of quality, density is negatively correlated with alcohol.


Reflection

The data set contains 1599 red variants of the Portuguese “Vinho Verde” wine. I started by understanding the individual variables in the data set, and I was interested in “alcohol” feature because wine is a kind of liquor.

During the exploring data set, I found interesting distribution with “citric.acid”. It has three peaks around 0, 0.25 and 0.5g/dm^3. About 9% of red wine has 0 “citric.acid”.

As I expected, the most correlated feature of quality is “alcohol” and there are another features that has relation with quality. “volatile.acidity” is also correlated with quality and “sulphates” is negatively correlated. The linear model with only “alcohol” variable has 0.231 R-sqaured value. By adding “volatile.acidity” and “sulphates”, R-squared value is increased with 0.341.

“alcohol” is negatively correlated with “density” regardless of quality. Percent of “alcohol” is increased, “density” is decreased.

Since the data set consists of samples from the specific red wine mentioned above, there is a limitation of this analysis. It might be interesting to obtain data set from various regions to eliminate any bias created by various products.